{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "jUbPUmtaozIP"
      },
      "source": [
        "# Utilizing existing FAQs for Question Answering\n",
        "- **Level**: Beginner\n",
        "- **Time to complete**: 15 minutes\n",
        "- **Nodes Used**: `InMemoryDocumentStore`, `EmbeddingRetriever`\n",
        "- **Goal**: Learn how to use the `EmbeddingRetriever` in a `FAQPipeline` to answer incoming questions by matching them to the most similar questions in your existing FAQ.\n",
        "\n",
        "# Overview\n",
        "While *extractive Question Answering* works on pure texts and is therefore more generalizable, there's also a common alternative that utilizes existing FAQ data.\n",
        "\n",
        "**Pros**:\n",
        "\n",
        "- Very fast at inference time\n",
        "- Utilize existing FAQ data\n",
        "- Quite good control over answers\n",
        "\n",
        "**Cons**:\n",
        "\n",
        "- Generalizability: We can only answer questions that are similar to existing ones in FAQ\n",
        "\n",
        "In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "zBOtphIMozIT"
      },
      "source": [
        "\n",
        "## Preparing the Colab Environment\n",
        "\n",
        "- [Enable GPU Runtime](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration#enabling-the-gpu-in-colab)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "ENpLjBejozIW"
      },
      "source": [
        "## Installing Haystack\n",
        "\n",
        "To start, let's install the latest release of Haystack with `pip`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "q_y78_4LozIW"
      },
      "outputs": [],
      "source": [
        "%%bash\n",
        "\n",
        "pip install --upgrade pip\n",
        "pip install farm-haystack[colab,inference]"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Enabling Telemetry \n",
        "Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/telemetry) for more details."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {},
      "outputs": [],
      "source": [
        "from haystack.telemetry import tutorial_running\n",
        "\n",
        "tutorial_running(4)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "Wl9Q6E3hozIW",
        "pycharm": {
          "name": "#%% md\n"
        }
      },
      "source": [
        "Set the logging level to INFO:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "id": "Edvocv1ZozIX",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "import logging\n",
        "\n",
        "logging.basicConfig(format=\"%(levelname)s - %(name)s -  %(message)s\", level=logging.WARNING)\n",
        "logging.getLogger(\"haystack\").setLevel(logging.INFO)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "id": "noVtM20ZozIX"
      },
      "source": [
        "### Create a simple DocumentStore\n",
        "The InMemoryDocumentStore is good for quick development and prototyping. For more scalable options, check-out the [docs](https://docs.haystack.deepset.ai/docs/document_store)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zeVfvRLZozIY"
      },
      "outputs": [],
      "source": [
        "from haystack.document_stores import InMemoryDocumentStore\n",
        "\n",
        "document_store = InMemoryDocumentStore()"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "zHevRxxaozIa"
      },
      "source": [
        "### Create a Retriever using embeddings\n",
        "Instead of retrieving via Elasticsearch's plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones).\n",
        "We can use the `EmbeddingRetriever` for this purpose and specify a model that we use for the embeddings."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "oFNXb3kIozIb",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from haystack.nodes import EmbeddingRetriever\n",
        "\n",
        "retriever = EmbeddingRetriever(\n",
        "    document_store=document_store,\n",
        "    embedding_model=\"sentence-transformers/all-MiniLM-L6-v2\",\n",
        "    use_gpu=True,\n",
        "    scale_score=False,\n",
        ")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "uLv8ysluozIb"
      },
      "source": [
        "### Prepare & Index FAQ data\n",
        "We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in our documentstore.\n",
        "Here: We download some question-answer pairs related to COVID-19"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AHiSltp4ozIb",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "from haystack.utils import fetch_archive_from_http\n",
        "\n",
        "\n",
        "# Download\n",
        "doc_dir = \"data/tutorial4\"\n",
        "s3_url = \"https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip\"\n",
        "fetch_archive_from_http(url=s3_url, output_dir=doc_dir)\n",
        "\n",
        "# Get dataframe with columns \"question\", \"answer\" and some custom metadata\n",
        "df = pd.read_csv(f\"{doc_dir}/small_faq_covid.csv\")\n",
        "# Minimal cleaning\n",
        "df.fillna(value=\"\", inplace=True)\n",
        "df[\"question\"] = df[\"question\"].apply(lambda x: x.strip())\n",
        "print(df.head())\n",
        "\n",
        "# Create embeddings for our questions from the FAQs\n",
        "# In contrast to most other search use cases, we don't create the embeddings here from the content of our documents,\n",
        "# but rather from the additional text field \"question\" as we want to match \"incoming question\" <-> \"stored question\".\n",
        "questions = list(df[\"question\"].values)\n",
        "df[\"embedding\"] = retriever.embed_queries(queries=questions).tolist()\n",
        "df = df.rename(columns={\"question\": \"content\"})\n",
        "\n",
        "# Convert Dataframe to list of dicts and index them in our DocumentStore\n",
        "docs_to_index = df.to_dict(orient=\"records\")\n",
        "document_store.write_documents(docs_to_index)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {
        "collapsed": false,
        "id": "MXteNgYRozIb"
      },
      "source": [
        "### Ask questions\n",
        "Initialize a Pipeline (this time without a reader) and ask questions"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 29,
      "metadata": {
        "id": "F5O7r3poozIb",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [],
      "source": [
        "from haystack.pipelines import FAQPipeline\n",
        "\n",
        "pipe = FAQPipeline(retriever=retriever)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 34,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 709,
          "referenced_widgets": [
            "070f7d6a12804647b2c4f5ec98241ced",
            "8678507de5e748219ba28bb7970c0e63",
            "35855d91133f474092381950bdbfce58",
            "0656e34a277141d184aef005e4d39f88",
            "612af309a6a94477b56dcea22c7a0940",
            "dc9c54def7bf47d39819a97b7ceed839",
            "09f4ba018a514f1ca2929ece4d0335e2",
            "58f3458cddc747d7b1a1c05f8f0664ed",
            "cd2614a0933c48a391966cb572044710",
            "52abbb2d8eb043a0924d705a99577303",
            "04495cdbd0e04e02a91ae3b026ef4c46"
          ]
        },
        "id": "QX6qbic2ozIc",
        "outputId": "af0a8eda-f7f6-4c97-cda7-13566ff888b1",
        "pycharm": {
          "name": "#%%\n"
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "070f7d6a12804647b2c4f5ec98241ced",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Batches:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Query: How is the virus spreading?\n",
            "Answers:\n",
            "[   {   'answer': 'This virus was first detected in Wuhan City, Hubei '\n",
            "                  'Province, China. The first infections were linked to a live '\n",
            "                  'animal market, but the virus is now spreading from '\n",
            "                  'person-to-person. It’s important to note that '\n",
            "                  'person-to-person spread can happen on a continuum. Some '\n",
            "                  'viruses are highly contagious (like measles), while other '\n",
            "                  'viruses are less so.\\n'\n",
            "                  '\\n'\n",
            "                  'The virus that causes COVID-19 seems to be spreading easily '\n",
            "                  'and sustainably in the community (“community spread”) in '\n",
            "                  'some affected geographic areas. Community spread means '\n",
            "                  'people have been infected with the virus in an area, '\n",
            "                  'including some who are not sure how or where they became '\n",
            "                  'infected.\\n'\n",
            "                  '\\n'\n",
            "                  'Learn what is known about the spread of newly emerged '\n",
            "                  'coronaviruses.',\n",
            "        'context': 'This virus was first detected in Wuhan City, Hubei '\n",
            "                   'Province, China. The first infections were linked to a '\n",
            "                   'live animal market, but the virus is now spreading from '\n",
            "                   'person-to-person. It’s important to note that '\n",
            "                   'person-to-person spread can happen on a continuum. Some '\n",
            "                   'viruses are highly contagious (like measles), while other '\n",
            "                   'viruses are less so.\\n'\n",
            "                   '\\n'\n",
            "                   'The virus that causes COVID-19 seems to be spreading '\n",
            "                   'easily and sustainably in the community (“community '\n",
            "                   'spread”) in some affected geographic areas. Community '\n",
            "                   'spread means people have been infected with the virus in '\n",
            "                   'an area, including some who are not sure how or where they '\n",
            "                   'became infected.\\n'\n",
            "                   '\\n'\n",
            "                   'Learn what is known about the spread of newly emerged '\n",
            "                   'coronaviruses.',\n",
            "        'score': 0.9358832836151123}]\n"
          ]
        }
      ],
      "source": [
        "from haystack.utils import print_answers\n",
        "\n",
        "# Run any question and change top_k to see more or less answers\n",
        "prediction = pipe.run(query=\"How is the virus spreading?\", params={\"Retriever\": {\"top_k\": 1}})\n",
        "\n",
        "print_answers(prediction, details=\"medium\")"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.9"
    },
    "vscode": {
      "interpreter": {
        "hash": "bda33b16be7e844498c7c2d368d72665b4f1d165582b9547ed22a0249a29ca2e"
      }
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "04495cdbd0e04e02a91ae3b026ef4c46": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "0656e34a277141d184aef005e4d39f88": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_52abbb2d8eb043a0924d705a99577303",
            "placeholder": "​",
            "style": "IPY_MODEL_04495cdbd0e04e02a91ae3b026ef4c46",
            "value": " 1/1 [00:00&lt;00:00, 12.19it/s]"
          }
        },
        "070f7d6a12804647b2c4f5ec98241ced": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HBoxModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HBoxModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HBoxView",
            "box_style": "",
            "children": [
              "IPY_MODEL_8678507de5e748219ba28bb7970c0e63",
              "IPY_MODEL_35855d91133f474092381950bdbfce58",
              "IPY_MODEL_0656e34a277141d184aef005e4d39f88"
            ],
            "layout": "IPY_MODEL_612af309a6a94477b56dcea22c7a0940"
          }
        },
        "09f4ba018a514f1ca2929ece4d0335e2": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "DescriptionStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "description_width": ""
          }
        },
        "35855d91133f474092381950bdbfce58": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "FloatProgressModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "FloatProgressModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "ProgressView",
            "bar_style": "success",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_58f3458cddc747d7b1a1c05f8f0664ed",
            "max": 1,
            "min": 0,
            "orientation": "horizontal",
            "style": "IPY_MODEL_cd2614a0933c48a391966cb572044710",
            "value": 1
          }
        },
        "52abbb2d8eb043a0924d705a99577303": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "58f3458cddc747d7b1a1c05f8f0664ed": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "612af309a6a94477b56dcea22c7a0940": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        },
        "8678507de5e748219ba28bb7970c0e63": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "HTMLModel",
          "state": {
            "_dom_classes": [],
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "HTMLModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/controls",
            "_view_module_version": "1.5.0",
            "_view_name": "HTMLView",
            "description": "",
            "description_tooltip": null,
            "layout": "IPY_MODEL_dc9c54def7bf47d39819a97b7ceed839",
            "placeholder": "​",
            "style": "IPY_MODEL_09f4ba018a514f1ca2929ece4d0335e2",
            "value": "Batches: 100%"
          }
        },
        "cd2614a0933c48a391966cb572044710": {
          "model_module": "@jupyter-widgets/controls",
          "model_module_version": "1.5.0",
          "model_name": "ProgressStyleModel",
          "state": {
            "_model_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_model_name": "ProgressStyleModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "StyleView",
            "bar_color": null,
            "description_width": ""
          }
        },
        "dc9c54def7bf47d39819a97b7ceed839": {
          "model_module": "@jupyter-widgets/base",
          "model_module_version": "1.2.0",
          "model_name": "LayoutModel",
          "state": {
            "_model_module": "@jupyter-widgets/base",
            "_model_module_version": "1.2.0",
            "_model_name": "LayoutModel",
            "_view_count": null,
            "_view_module": "@jupyter-widgets/base",
            "_view_module_version": "1.2.0",
            "_view_name": "LayoutView",
            "align_content": null,
            "align_items": null,
            "align_self": null,
            "border": null,
            "bottom": null,
            "display": null,
            "flex": null,
            "flex_flow": null,
            "grid_area": null,
            "grid_auto_columns": null,
            "grid_auto_flow": null,
            "grid_auto_rows": null,
            "grid_column": null,
            "grid_gap": null,
            "grid_row": null,
            "grid_template_areas": null,
            "grid_template_columns": null,
            "grid_template_rows": null,
            "height": null,
            "justify_content": null,
            "justify_items": null,
            "left": null,
            "margin": null,
            "max_height": null,
            "max_width": null,
            "min_height": null,
            "min_width": null,
            "object_fit": null,
            "object_position": null,
            "order": null,
            "overflow": null,
            "overflow_x": null,
            "overflow_y": null,
            "padding": null,
            "right": null,
            "top": null,
            "visibility": null,
            "width": null
          }
        }
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
